254 ◾ Bioinformatics
The analysis of the metagenomics data involves classification of an individual sequence
to a bacterial taxon or taxa. Bacteria are classified into hierarchical taxonomic groups in
descending order (kingdom, phylum, class, order, family, genus, species, and subspecies).
The analysis also includes construction of phylogenetic tree for the bacterial community,
quantification of bacterial presence in the sample (abundance), and the microbial diversity
in an individual sample and across samples. A number of bioinformatics tools are available
for analyzing metagenomic data.
In the amplicon-based metagenomics, the targeted region is a marker gene or any part
of a gene or any genomic region appropriate for microbial identification. The ideal tar-
get region is the one that includes a highly conserved sequence surrounded by less con-
served and it must be present ubiquitously in all target species and with available reference
sequences in the sequence databases. The 16S rRNA gene is one of the best candidate
marker genes. The 16S rRNA gene codes for a component of the 30S small subunit of the
bacterial ribosome. It is around 1500 bp consisting of nine conserved regions surrounding
hyper-variable regions. The sequences of the conserved regions are labeled as C1, C2, …,
C9 and the variable regions are labeled as V1, V2, …, V9. PCR primers are designed from
the conserved region close to the variable region so that the species-specific targeted region
is amplified and enriched by the PCR amplification. Then, the amplicons are sequenced
and analyzed.
Several samples are usually sequenced in a single run using multiplexing approach in
which unique barcode sequences are ligated to the DNA of each sample in the library
preparation step. After sequencing, the reads are demultiplexed by separating the reads of
the individual samples into separate FASTQ files before analysis. Since the amplicon-based
metagenomics depends on a targeted region of the genomes, it has less resolution than the
shotgun whole genome sequencing but is less expensive.
7.2 ANALYSIS WORKFLOW
In the following, we will discuss the steps of the workflow of the amplicon-based metage-
nomics data analysis, which include raw data preprocessing, read clustering, denoising
(error removal), taxonomic group assignment, construction of phylogenetic tree, and
diversity analysis.
7.2.1 Raw Data Preprocessing
After sequencing the targeted marker, raw sequence data is obtained in FASTA or FASTQ
format. In the case of FASTA format file produced by Sanger sequencing method, the per
base quality score may be provided in a separate file. The format of FASTQ files allows
base quality scores to be in the same file. The base quality scores reveal base call qual-
ity of each base and enable us to assess the sequence reads and to determine if the reads
require preprocessing before the analysis. The quality control step should not be taken
lightly. Errors in base calls may occur due to the library preparation or sequencing. The
reads of low-quality scores are usually filtered or truncated so that the errors do not affect
the final results. The preprocessing of the raw sequence data may also include demultiplex-
ing if multiple samples are sequenced in a single run. The demultiplexing step depends on